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Abstract — This paper considers the clustering problem for 
large data sets. We propose an approach based on distributed 
optimization. The clustering problem is formulated as an opti- 
mization problem of maximizing the classification gain. We show 
that the optimization problem can be reformulated and decom- 
posed into small-scale sub optimization problems by using the 
Dantzig-Wolfe decomposition method. Generally speaking, the 
Dantzig- Wolfe method can only be used for convex optimization 
problems, where the duality gaps are zero. Even though, the 
considered optimization problem in this paper is non-convex, we 
prove that the duality gap goes to zero, as the problem size 
goes to infinity. Therefore, the Dantzig-Wolfe method can be 
applied here. In the proposed approach, the clustering problem 
is iteratively solved by a group of computers coordinated by one 
center processor, where each computer solves one independent 
small-scale sub optimization problem during each iteration, and 
only a small amount of data communication is needed between 
the computers and center processor. Numerical results show that 
the proposed approach is effective and efficient. 



I. Introduction 

In the recent years, due to the rapid progress of data acqui- 
sition and communication technologies, it has become readily 
easy to collect and store large amounts of data. Large databases 
of scientific measurements at the scale of terabyte or even 
petabyte can be frequently observed in high energy physics, 
astronomy, space exploration and human genome projects. 
Large databases of financial data and sale transactions at the 
scale of terabyte or petabyte can also be frequently observed. 
These huge amounts of data usually contain valuable scientific 
and business information. For example, a large collection of 
sale transaction data may contain important information of 
consumer behaviors and market trends. However, the data 
analysis on such large databases presents many technique chal- 
lenges. The database size is usually far larger than the memory 
size of any single computer. Many existing centralized data 
analysis algorithms fail for these instances. In fact, most data 
analysis problems for large databases are currently open or not 
well-solved. 

In this paper, we consider one important data analysis prob- 
lem, the clustering problem for large databases. The clustering 
problem is the problem that a set of given data samples 
are classified into different groups, so that, the data samples 
within each group are similar according to certain metrics. 
Clustering is a fundamental problem in data analysis. It has 
many applications in pattern recognition, machine learning, 
data mining, computer vision, and signal processing. For 
example, clustering is usually an important step in many data 
mining algorithms. 

Many algorithms for clustering problems have been pre- 
viously discussed in the literature, see for example [1] and 



references therein. These algorithms range from heuristic al- 
gorithms to statistical modeling based algorithms. Among the 
previous algorithms, the statistical modeling based methods 
generally have better clustering performance compared with 
other types of algorithms, especially when the data clusters 
are not well separated. The Expectation-Maximization (EM) 
algorithms with mixture Gaussian modeling [2] [3] are the 
major state-of-the-art statistical modeling based clustering 
algorithms. The EM algorithms can be considered as iterative 
algorithms for computing the maximum likelihood estimation. 
It has been proven that the likelihood functions do not decrease 
during iterations. 

However, it is well-known that the EM algorithms have 
certain limitations. First, according to previous experimental 
results, the EM algorithms may convergence very slowly [4], 
[5], It is shown in [6], that the EM algorithms are first-order 
optimization algorithms, which provides a theoretical expla- 
nation for the slow convergence speeds. In fact, it has been 
a long-standing open problem that super-linear and second- 
order methods should be found and preferred for the clustering 
problems [7]. Second, the EM algorithms do not converge and 
have numerical difficulties for certain types of instances [4], 
[8]. For example, the EM algorithms do not converge, when 
the covariance matrices are singular. The EM algorithms also 
do not converge, when the numbers of components in the 
mixture modeling are greater than the actual numbers of data 
clusters. 

In addition, the standard EM algorithms require memory 
spaces proportional to the database size, therefore, do not 
scale well. Various scaling-up versions of the standard EM 
algorithms have been proposed in the literature [9], [10]. How- 
ever, these previous approaches are approximation algorithms. 
The accuracy of the obtained results decreases as the ratio 
between the database size and main processor memory space 
size increases. 

In this paper, we propose a new clustering algorithm for 
large databases based on data compression principles and 
mixture Gaussian modeling. Following the approaches in 
[11], we formulate the clustering problems as optimization 
problems. Instead of using a centralized approach, we pro- 
pose a distributed algorithm to solve the global optimization 
problems. In our approach, the global optimization problem 
is decomposed into small-scale sub optimization problems us- 
ing the Dantzig-Wolfe decomposition method [12]. Generally 
speaking, the Dantzig-wolfe method can only be used in the 
convex optimization case, where the duality gaps are zero. 
Even though, the considered problem in this paper is non- 
convex, we show that the duality gap goes to zeros as the prob- 
lem size goes to infinity. Therefore, the Dantzig-Wolfe method 
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can be applied here. Our algorithm is especially suitable for 
the cases of distributed databases, where data are stored at 
multiple hosts or even at different geographical locations. The 
global optimal solutions can be computed with only intra-host 
computations, intra-host local database queries and a small 
amount of inter-host communications. Unlike many clustering 
algorithms for large databases, which compute approximate 
solutions, our algorithm computes exact solutions. Numerical 
results show that the proposed algorithm does not have any 
numerical difficulties for the case that the covariance matrices 
are singular. Numerical results also show that the algorithm 
has fast convergence speeds. 

The rest of this paper is organized as follows. We present 
the proposed algorithm in Section [TT] We prove that the duality 
gap is vanishing for sufficiently large databases in Section Hill 
Numerical results are presented in Section [IV] We present the 
conclusion remark in Section [V] 

Notation: We use bold face lower-case letters and bold 
face capital letters to denote the column vectors and matrices 
respectively. For example, we use a to denote a column vector 
a. We use a(d) to denote the d-th element of the vector a. 
We use A 1 to denote the transpose of the matrix A. We use 
H(pi, . . . ,pj) to denote the entropy function, 

J 

H(pi,...,pj)=^2-pi\og(pi). (1) 

We use log(x) to denote the natural logarithm of the number x. 
We use det{A) to denote the determinant of the matrix A. 

II. Clustering Algorithm 

In this paper, we consider a data set consisting of N 
data samples, Xi,X2, ■ ■ ■ ,%n, where each data sample is a 
D dimensional vector. We assume that the data samples are 
randomly distributed with a mixture Gaussian distribution. 
That is, 

P{x n ) = 

S> (2^2^)175 gx p{4 (x " -^~\x n -mO 

(2) 

Alternatively, we may consider X\, . . . ,x n , . . . ,Xn as a mix- 
ture of data samples from J information sources, where each 
information source is Gaussian distributed. The considered 
problem is therefore estimating the membership of each data 
sample to one of the J information sources, and also the 
probability distribution of each information source. 

In this paper, we propose a distributed algorithm for the 
above clustering problem. Our algorithm is efficient for the 
case that the data set contains a large amount of data samples. 
The data samples can be stored at multiple computers or 
database hosts. The proposed algorithm formulates the cluster- 
ing problem as an optimization problem and decomposes the 
optimization problem into multiple small-scale sub optimiza- 
tion problems. Each sub optimization problem can be solved 
at one database host using only locally stored data samples. A 
center processor coordinates the computation at the database 



hosts. The final solution is obtained from the sub optimization 
results. A diagram of the system is shown in Fig. Q] 

Center processor 




Database host 



Fig. 1. The diagram of the system. 

The algorithm in this paper is built up on the data compres- 
sion based algorithm for clustering in [11]. The main idea be- 
hind the algorithm is that optimal data clustering should induce 
optimal adaptive data compression. That is, if we partition the 
data set into several clusters and use one data compression 
encoder for each cluster, then the optimal compression per- 
formance is achieved if each cluster contains only the data 
samples from one information source. The algorithm in [11] 
then formulates the data cluster problem as an optimization 
problem, where the classification gain is maximized. The 
classification gain is a measure of data compression efficiency 
previously proposed in the data compression literature [13]. 

If the covariance matrices of all clusters are not singular, 
then the classification gain is inversely proportional to the 
following function, 

J 

2H( Pl , ...,pj) + J2Pi lo § ( det ( E *)) (3) 

i=l 

where, pi is the fraction of data samples in the i-th cluster, 
and Si is the covariance matrix of the i-th cluster. The 
above function is the objective function in our optimization 
formulation. In the sequel, we will always assume that the 
covariance matrices of all clusters are not singular without 
the loss of generality. Because, if any covariance matrix 
is singular, we can minimize the following function in the 
algorithm instead, 

J 

2H( Pl , ...,pj) + Y,Pi lo § ( det ( S * + ° 2 Jd)) (4) 

where, of t is a sufficiently small positive number, and I o is 
the D dimensional identity matrix. This is equivalent to adding 
white noise with covariance matrix u^Id to the data samples 
and clustering the noise corrupted data samples instead. The 
optimality of the final obtained clustering results is not much 
affected, if cr^ is small enough. 
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The proposed algorithm formulates the clustering problem 
as an optimization problem. We introduce a variable a n i for 
each n, i, 1 < n < N, and 1 < i < J. The variable a n i 
is a likelihood that the n-th data sample belongs to the i- 
th information source. The mean p,i, covariance matrix £ 2 , 
and occurrence probability pi are functions of the likelihood 
variables a n i, 



2^n=l a ni 



(5) 



( \^a ni (x n - Hi){x n - Hif (6) 

\ Z^n=l a n-i J n=l 



Pi 



N 



(7) 



The formulated optimization problem is therefore, 

mm 1 2H( Pl , . . . , pj) + Pi lo S ( det | 

Subject to: a e (8) 
where, o is a vector obtained by stacking all the variables a n i, 



n = < a 



;.=i 



= 1,0 < a m < 1 



(9) 



The final estimation results can be obtained by randomly 
rounding the optimal solution a* M of the above optimization 
problem as in [11]. The near-optimality of this optimization 
based approach has been shown in [11] and [14]. 

In the sequel, we show that the optimization problem in 
Eqn. |8] can be reduced into sub optimization problems that 
can be locally solved at each database host. The reduction 
and reformulation procedure consists of four steps. 

In the first step of reformulating the problem, we adopt an 
approach of first solving the restricted optimization problems 
with pi being fixed, 

g(jpi, . . . ,pj)* 

= mm 1 2H (ft, . . . , p j) + ^ ft log (det (£;)) j 

N 

Subject to: a G 57, and a n i — PiN, for all i, (10) 



And then, we optimize over Pi,...,pj to find the overall 
optimization solution, 



_min g(pi,...,pj)*, 

pi,—tpj 



Subject to: V^Pi = 1,0 < pi < 1. 



(11) 



The problem in Eqn. [TT] can be easily solved by using the 
gradient descent approach. The main problem is therefore 
reduced to the optimization problem in Eqn. [10] 

In the second step of reformulating the problem, we in- 
troduce auxiliary unitary matrices A%,...,Aj. We define 
Bi = AiY,iA\, for i = 1, . . . , J. It can be shown that the 



optimization problem in Eqn. [10] is equivalent to the following 
optimization problem. 

J D 

min ^Pi^ lo § ( CT «i) ' 
J ' a i=i d=i 

Subject to: a G ft, Ai, . . . ,Aj are unitary 



N 



a-ni = PiN 



(12) 



where, a\ d is the <i-th diagonal element of the matrix Bi. The 
two optimization problems are equivalent, because 



J D 



^ftlogCde^E^^^ft^log^) (13) 



;.=i 



i=l 



due to the Hadamard inequality [15, page 502, Thm. 16.8.2], 
and clearly equality can be achieved by certain A\, . . . ,Aj. 

We solve the optimization problem in Eqn. Q~2] by an 
alternating optimization approach. That is, we iteratively first 
fix A\,...,Aj and optimize over a, and then fix a and 
optimize over A\, . . . , Aj. The latter optimization problem 
is easy to solve, because the optimal A\ , . . . , A j are clearly 
the matrices, such that Bi becomes diagonal. The main 
optimization problem is therefore reduced to 

J D 
i=l d=l 

N 

Subject to: a 6 Q,, \_. a m — PiN, 



(14) 



71=1 



where, A\ , . . . , A j are fixed and given. 

In the third step of reformulating the problem, we use an 
iterative upper bounding and minimizing approach to solve the 
optimization problem in Eqn. [14] Let af d [t] denote the solution 
obtained in the t-th iteration. Note that the objective function 
in Eqn [14] can be upper bounded as follows, due to the fact 
that the objective function is concave with respect to of d . 



i=i d=i 

J D 
4=1 d=l 



J D ~ 
\- \- Pi 



2^ 



(15) 



In the (t + l)-th iteration, we find a solution a, such that 
the corresponding af d minimizes the above upper bound. It 
can be seen clearly that the objective function never increase 
during iterations. Therefore, the main optimization problem is 
reduced to the following optimization problem. 

m i n EE^^ r 

U=l d=l J 

N 

Subject to: o£!l, a ni = PiN, 
where (3 id = l/af d [t]. 



n=l 



(16) 
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In the fourth and final step of reformulating the problem, 
we decompose the optimization problem in Eqn. [16] into sub 
optimization problems by using the Dantzig-Wolfe decompo- 
sition method. Each sub optimization problem can be locally 
solved at each database host. The Dantzig-Wolfe decompo- 
sition method is introduced initially for linear programming 
problems [12]. The method has been then generalized to the 
convex optimization cases, where the duality gaps are zero, 
(see for example [16] and references therein). For non-convex 
optimization problems, the decomposition method generally 
can not be applied due to the non-zero duality gaps. Even 
though the optimization problem in Eqn. [TBI is not convex, we 
show in Theorem 13.61 that the duality gap goes to zeros as 
the number of data samples N goes to infinity. Therefore, the 
decomposition method can be applied here. 

Let us assume that the data samples X\, . . . ,x n , . . . ,Xn 
are stored at K database hosts. Let A4 denote the set of the 
indexes of the data samples stored at the fc-th host. We use 
AiX n (d) to denote the d-th element of the vector AiX n . The 
optimization problem in Eqn. [TBI is equivalent to the following 
optimization problem. 



J K D 



a,M l i=l k=l n£Af k d=l 

Subject to: 

J N 

^ a ni = 1, < a ni < 1, a ni /N = p it 



71=1 



l^ikd — ~ , T / a niAiX n (cf) , 
PiN ^ 



(17) 



where, jl is the vector obtained by stacking all the variables 
V-ikd- The real number fakd can be considered as a local guess 
or estimation of the mean of AiX n (d) at the fc-th database host. 
If all the local guesses are equal, then the above objective 
function is equal to the objective function in Eqn. [TB] 

Because the duality gap is approximately zero as proven 
in Theorem 13.61 the optimization problem in Eqn. [17] is 
approximately equivalent to its Lagrangian dual problem as 
follows. 

{J K D -\ 

YY Y Y Pid{AiX n {d) - jiikd) 2 \ 
i=l k=l neMh d=l ) 



J K D / 
i=l k=l d=l V 



1 N \ 

~ Ar / a>niAiX n (d) 

n=l / 



2J ^P* 



N 



a m/N 



Pi 



Subject to: a € fi, 



(18) 



where, A denotes the vector obtained by stacking all variables 
A^i/cd and \ P i. The above optimization problem is separable 



and can be rewritten as, 



K 



max 2^ fk 

k=l i=l 



(19) 



where, each is the optimization result of one sub optimiza- 
tion problem. Let ak denote the vector obtained by stacking all 
variables a n i with n £ A4- Let fik denote the vector obtained 
by stacking all parameters fiikd, i— 1, • • • j J, d = 1, . . . , D, 



D 



fk = m L n E E E -^Pid{A l x n (d) - % lk df 

k i=l n6Aifc d=l 
J D J 

+ E E ^fiikdfiikd 

i=l d=l i=l neNk 

J K D . 

A,, 



7 1 / 
N 



i=l fc=l d=l neM k 
J 

Subject to: a n i = 1, < a«j < 1, for n £ A4- (20) 

i=l 

It can be clearly checked that each can be solved locally at 
each database host using only information about local data 
samples x n , n £ A4 with given parameters fiid, A, and 
Ai,...,Aj. 

Therefore, the proposed algorithm iteratively computes the 
clustering result. During each iteration, each database host 
solves one local small-scale optimization problem as in Eqn. 
l20l The center processor then solves the global optimization 
problem as in Eqn. [19] using the local optimization results. 
The global optimization problem can be solved by using, for 
example, the subgradient method [16, Section 6.3.1]. 

III. Vanishing Duality Gap 

In this section, we prove that the duality gap between 
the primal optimization problem in Eqn. [TT] and the dual 
optimization problem in Eqn. [18] goes to zero as the problem 
size N goes to infinity. We need the Azuma inequality in our 
discussion. A proof of the inequality can be found, for example 
in[17][18]. 

Lemma 3.1: (Azuma Inequality) Let Zi, . . . , Zp? be inde- 
pendent random variables, with Zk taking values in a set A/-. 
Assume that a (measurable) function / : A 1 xA 2 x---xAjv — > 
R satisfies the following Lipschitz condition (L). 

• (L) If the vectors z, z' £ Yli A» differ only in the fcth 
coordinate, then — f(z')\ < Ck, k = 1, . . . , N. 

Then, the random variable X = f(Z\, . . . , Zm) satisfies, for 
any t > 0, 



¥(X > EX + t) < exp 



F(X < EA —t)< exp 



-2t 2 



' -2t 2 ' 



N 



(21) 



(22) 



The basic idea is to use randomization. Randomization has 
been used previously in establishing stronger duality theories. 
We refer interested readers to [19] and references therein. Let 
p(a, ft) denote the probability distribution of a and % where 
the range of a is fl, and 



min AiX n (d) < % k d < maxAiX n (d). 



(23) 
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We introduce the following randomized primal optimization 
problem. 

Fy U=l k=l n£Af k d=l ) 



Subject to: 



E 



E 



(J-ikd 



1 N " 
~T7 ^ ' a-niAiX n (d) 



1 N 
— > a 

N 



0, for all j2. 



(24) 



The corresponding Lagrangian randomized dual problem is 

{,] K D \ 

EE E E -wPid(AiXn(d) - TMkdf \ 
i=l fc=l nSA^fc d=l J 



EV(m)e 



J K D 



1 * 

— > a. 



Pi 



n=l 



EEEwe 

i = l fc=l d=l 



dp 



l^ikd ~ , T / UniAiXn^d) 
PiN ^ 



(25) 



Let us denote the optimal solutions of the primal opti- 
mization problem in Eqn. [17] randomized primal optimization 
problem in Eqn. [24] dual optimization problem in Eqn. [18] 
and randomized dual optimization problem in Eqn. [25] by 
P*, PR*, D* , and DR* respectively. We have the following 
lemmas. 

Lemma 3.2: 

PR* < P* (26) 
Proof: The lemma follows from the fact that each 
deterministic variable can be considered as a random variable 
with a singleton probability distribution. ■ 

Lemma 3.3: 



DR* < D* 
Proof: Similar as the proof of Lemma 13.21 
Lemma 3.4: 



(27) 



PR* = DR* (28) 
Proof: We may define the following PR t optimization 
problem. 

f^ E (EE E J2^Mxn(d)-& kd A 



Subject to: 

i N 

E ' 



1 E 



dni Pi 



n=l 



< e, for all p, 



N 



%kd - E aniAiX n (d) 
PiN 

n—l 



< e. 



(29) 

(30) 
(31) 



It can be check that PR* < PR*, and PR* 
e ->• 0. The dual of the PR e problem DR t is 

( J K D 



PR*, as 



max mm E < —P td (AiX n {d) - ^ 

FV \ i=l k=\ nGA4 d=l 

J K D / 

+EEE A ;^ E 



ikd.) 



i=l k=l d=l 
J K D 

EEE(-DA 

i=l k=l d=l 



1 N ] \ 

■^-jt E aniAiXnjd) -el 

1 ^ 
~ 7V7- / , &niAiX n (d) 



J i=\ I L "=1 J 



Subject to: A; iM > 0, A+ M > 0, A^(m) > 0, A+ (?) > 0. 

(32) 

It can be also checked that DR* DR*, as e ->• 0. 

Now we show that PR t is a convex optimization problem. 
Let p 1 (a,p), p 2 (a,p.) be two probability distributions satisfy- 
ing all the constraints in the PR e problem. Let 

p(a,p) = ap 1 (a,p) + (1 - a)p 2 (a,p), (33) 

where, < a < 1. Equivalently, we may introduce a random 
variable z, P(z = 1) = a, ¥(z = 2) = 1 - a; p(a,p) = 
p 1 {a,p), if 2 = 1, and p{a,p) — p 2 (a,p), if z = 2. We can 
show that p(a,p) satisfies the constraint in Eqn.[30las follows. 



E 



N I 



E 



N 



N 



~N 



n=l 



N 

E 



N 



Pi 



-N~- P 

,n=l 

p(a, z = l\p)da 

p(a, z = 2\p)da 
p 1 {a\p)p(z = l\p)da 



p(a\p)da 



N 



Pi 



p 2 {a\p)p{z = 2\p)da 



<p(z = l\fi)e + p(z = 2\fi)e <e 
Similarly, 



E 



N 

E 

n=l 



Qui 

~N~ 



Pi 



> € 



(34) 



(35) 



We can also show that p(a,p) satisfies the constraint in Eqn. 
l3~Tlbv using the fact that the expectation is a linear functional. 
Finally, the objective function in Eqn. [29] is also convex, 
because the expectation is a linear functional. Therefore, the 
optimization problem PR e is a convex optimization problem. 

Because, PR e is a convex optimization problem and the 
Slater condition holds, PR* — DR* according to the strong 
duality theorem [20, Thm. 6.7]. Therefore, PR* = DR*. ■ 
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Lemma 3.5: Assume max„ im |x„— x m \\2 < V, for a fixed 
upper bound V, where | • 1 2 denotes the Euclidean norm. Then 
PR* — > P*, as N goes to infinity. 

Proof: Let p*(a,p) denote the optimal solution of the 
randomized primal problem. We can construct a probability 
distribution p(a, ju) as follows. 



JV 



p(o,J2) =p*@) Y[p*(a n i,...,a nJ \p), 



(36) 



where, the probability distributions at the right hand are 
marginal distributions. It can be checked that the probability 
p(a,j2) achieves the exactly same objective function and 
constraint function values in the randomized primal problem as 
the probability distribution p* (a, fi). Therefore, we can assume 
that p*(a,fi) takes the form in Eqn. [36] without the loss of 
generality. 

We define the typical set T(e) as 



N 

E 



N 



< e, for all i 



(37) 



The probability that (a,/5) is not in the typical set T(e) can be 
upper bounded by using the Azuma inequality and the union 
bound as follows. 



>[(a,£)£T(e)]<£l 

i=l 

j r r n 

<E/^ E^- 

i=l J |_ "=1 

^2 J 2exp(-2e 2 Ar)p(/i)dj2 



N 



> e 



2=1 



< 2Jexp(-2e 2 iV) 



(38) 



Due to the fact that the objective function is non-negative, 
the average achieved objective function values by (a, fi) in the 
typical set, 



e {eE E E ^-PiMiXnid) - p 

l i=l fe=l n&N'h d=l 

PR* 

< 



/kd J 



T(e) 



(39) 



P((o,/I)er(e)) 

Also by the above discussions, 

P[(a,ju) G T(e)] > 1 - 2Jexp (-2e 2 A^) (40) 

Therefore, we have that the average of the objective function 
in the typical set is bounded by 



e {ee e jy^-PiMi*n{d)-% 

l i=l k=l neMk d=l 

PR* 



kd) 



T(e) 



< 



1 - 2Jexp(-2e 2 A^) 



(41) 



There must exist one (a, n) in the typical set, such that 
the corresponding objective function is less than or equal to 



the above average. We can further modify the above a into a 
certain a G O, a = (. . . ,a„i, . . .), such that 



N 



^ a m /N = pi, 



(42) 



n=l 



and the corresponding objective function is raised by at most 
( J — 1) m&x{[3id}V 2 e. We can now set 



A' 



fJ'ikd — - .• / „ <l ni-A t X,J (I) . 

PiN ^ 



(43) 



Clearly, a n i and fiikd satisfy all the constraints in the primal 
problem. Therefore, 



i=l k=l nSA/fc d=l 
J K D 



^ikd) 



< EE E E jfa^w - vikdf 

i=l k=l n£A4 d=l 

PR* 



< 



1 - 2Jcxp(-2e 2 iV) 



(J - 1) max{ft}y e (44) 



where, (a) follows from the fact that Jiikd are the minimizer 
of the above quadratic function. The lemma then follows from 
the fact that PR* < P* . ■ 
Theorem 3.6: The duality gap P* — D* between the primal 
problem and dual problem goes to zero as the data sample 
number N goes to infinity. 

IV. Numerical Results 

In this section, we present numerical results for the proposed 
clustering algorithm. In Fig. [2] we depict the result of the 
proposed algorithm for the case of two overlapping clusters 
in a two dimensional space. Both the two clusters have zero 
mean. Their covariance matrices are as follows. 



80000 52000 
52000 35600 



192800 
-118800 



-118800 
74000 



(45) 



The total data sample number is 2048 and each cluster contains 
1024 data samples. We assume that the data samples can be 
observed by two database hosts, where the first database host 
can only observe the 1024 data samples from the first cluster, 
and the second database host can only observe the 1024 data 
samples from the second cluster. After the clustering result is 
obtained, we randomly select 128 data samples from the first 
cluster and 128 data samples from the second cluster and plot 
these data samples in the figure. The data samples classified 
into one cluster are plotted as red circles and the data sample 
classified into the other cluster are plotted as blue squares. 
The percentage of missed classified data samples is 5.32%. 
The clustering errors mainly occur at the regions where the 
two clusters overlap. The algorithm starts with two randomly 
selected unitary matrices A\, and A^. We observe that these 
matrices converge quickly. We also experiment with the cases 
that each database host observes a mixture of data samples 
from the two clusters with various percentages. The obtained 
results are not significantly different from the result in Fig. [2] 
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Fig. 2. Clustering results for two overlapping clusters. 



In Fig. [3] we depict the result of the proposed algorithm for 
the case of two overlapping clusters with one cluster having 
a singular covariance matrix. Both the two clusters have zero 
mean. Their covariance matrices are as follows. 
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The total data sample number is 2048 and each cluster contains 
1024 data samples. There are two database hosts, and the first 
database host can only observe the 1024 data samples from the 
first cluster, and the second database host can only observe the 
1024 data samples from the second cluster. In the formulated 
optimization problem, a term 0^/2, ofj = 0.5, is added to the 
objective function. The clustering results of randomly selected 
256 data samples are shown in the figure. The percentage 
of missed classified data samples is 1.71%. The results for 
the cases that each database host observes a mixture of data 
samples from the two clusters with various percentages are 
not significantly different from the result in the figure. The 
proposed clustering algorithm does not have any numerical or 
convergence difficulties for these cases. 
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Fig. 3. Clustering result for the case that one cluster has a singular covariance 
matrix. 

In Fig. [4j we depict the result of the proposed algorithm for 
the case of two clusters with different means. The first cluster 



has zero mean and covariance matrix 
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The second cluster has mean [800, 800]* and covariance matrix 
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The total data sample number is 2048 and each cluster contains 
1024 data samples. There are two database hosts, the first 
database host can only observe the 1024 data samples from 
the first cluster, and the second database host can only observe 
the 1024 data samples from the second cluster. The percentage 
of missed classified data samples is 2.29%. The results for 
the cases that each database host observes a mixture of data 
samples from the two clusters with various percentages are 
not significantly different from the result in the figure. 
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Fig. 4. Clustering result for the case that the two clusters have different 
means. 



In summary, we find that the proposed clustering algorithm 
has low missed classification probability and fast convergence 
speeds. The algorithm does not have numerical or convergence 
difficulties for the case of singular covariance matrices. The 
proposed algorithm is a promising approach for future large- 
scale data analysis. 



V. Conclusion 

This paper proposes a large-scale data clustering algorithm 
based on distributed optimization. We show that the duality 
gap of the considered optimization problem goes to zero 
as the problem size goes to infinity. Therefore, the global 
optimization problem can be decomposed into small-scale sub 
optimization problems by using the Dantzig-Wolfe method. 
The small-scale sub optimization problems can be solved using 
a group of computers coordinated by one center processor. 
Numerical results show that the proposed algorithm is effec- 
tive, efficient and does not have numerical or convergence 
difficulties. 
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